Skip to content

Conversation

@samkul-swe
Copy link
Contributor

@samkul-swe samkul-swe commented Nov 23, 2025

Description

Added a new Webcrawler Connector that allows users to crawl and index web pages into their search spaces. The connector supports both Firecrawl and AsyncChromiumLoader (free fallback) for flexible web scraping capabilities.

Key Changes:

  • Created WebCrawlerConnector class for web page crawling
  • Implemented index_crawled_urls indexer with date range support
  • Added search_webcrawler function for retrieving indexed web content
  • Built UI components for connector configuration and management
  • Added validation and error handling for webcrawler operations

Motivation and Context

Users need the ability to index external web content into their search spaces for comprehensive knowledge retrieval. This connector enables:

  • Crawling documentation sites, blogs, and web resources
  • Optional Firecrawl integration for enhanced crawling
  • Free fallback option using AsyncChromiumLoader
  • Automatic content change detection and updates

FIX #468

Screenshots

Screenshot 2025-11-22 at 7 31 23 PM Screenshot 2025-11-22 at 7 31 58 PM Screenshot 2025-11-22 at 7 32 16 PM Screenshot 2025-11-22 at 7 32 40 PM

API Changes

  • This PR includes API changes

New Endpoints/Routes:

  • Webcrawler connector creation and configuration
  • Webcrawler indexing endpoint with URL list support
  • Search endpoint returns WEBCRAWLER_CONNECTOR type results

Database Changes:

  • Added WEBCRAWLER_CONNECTOR to SearchSourceConnectorType enum
  • Documents stored with DocumentType.CRAWLED_URL
  • Config fields: FIRECRAWL_API_KEY (optional), INITIAL_URLS (optional)

Change Type

  • Bug fix
  • New feature
  • Performance improvement
  • Refactoring
  • Documentation
  • Dependency/Build system
  • Breaking change
  • Other (specify):

Testing Performed

  • Tested locally
  • Manual/QA verification

Checklist

  • Follows project coding standards and conventions
  • Documentation updated as needed
  • Dependencies updated as needed (firecrawl, langchain_community)
  • No lint/build errors or new warnings
  • All relevant tests are passing

High-level PR Summary

Analyze latest changes

Need help? Join our Discord

@vercel
Copy link

vercel bot commented Nov 23, 2025

@samkul-swe is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

Copy link

@recurseml recurseml bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by RecurseML

🔍 Review performed on 70f3381..ad75f81

✨ No bugs found, your code is sparkling clean

✅ Files analyzed, no issues (31)

surfsense_backend/alembic/versions/38_add_webcrawler_connector_enum.py
surfsense_backend/app/agents/researcher/qna_agent/default_prompts.py
surfsense_backend/app/agents/researcher/utils.py
surfsense_backend/app/config/__init__.py
surfsense_backend/app/connectors/webcrawler_connector.py
surfsense_backend/app/db.py
surfsense_backend/app/routes/documents_routes.py
surfsense_backend/app/routes/search_source_connectors_routes.py
surfsense_backend/app/services/connector_service.py
surfsense_backend/app/tasks/celery_tasks/connector_tasks.py
surfsense_backend/app/tasks/celery_tasks/document_tasks.py
surfsense_backend/app/tasks/celery_tasks/schedule_checker_task.py
surfsense_backend/app/tasks/connector_indexers/__init__.py
surfsense_backend/app/tasks/connector_indexers/webcrawler_indexer.py
surfsense_backend/app/tasks/document_processors/__init__.py
surfsense_backend/app/tasks/document_processors/url_crawler.py
surfsense_backend/app/utils/periodic_scheduler.py
surfsense_backend/app/utils/validators.py
surfsense_web/app/dashboard/[search_space_id]/connectors/[connector_id]/edit/page.tsx
surfsense_web/app/dashboard/[search_space_id]/connectors/[connector_id]/page.tsx
surfsense_web/app/dashboard/[search_space_id]/connectors/add/webcrawler-connector/page.tsx
surfsense_web/app/dashboard/[search_space_id]/documents/webpage/page.tsx
surfsense_web/components/dashboard-breadcrumb.tsx
surfsense_web/components/editConnector/types.ts
surfsense_web/components/homepage/integrations.tsx
surfsense_web/components/sources/connector-data.tsx
surfsense_web/content/docs/docker-installation.mdx
surfsense_web/contracts/enums/connector.ts
surfsense_web/contracts/enums/connectorIcons.tsx
surfsense_web/lib/connectors/utils.ts
surfsense_web/messages/en.json

Copy link

@recurseml recurseml bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by RecurseML

🔍 Review performed on ad75f81..ebea98c

✨ No bugs found, your code is sparkling clean

✅ Files analyzed, no issues (3)

surfsense_backend/app/agents/researcher/nodes.py
surfsense_backend/app/utils/validators.py
surfsense_web/hooks/use-connector-edit-page.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants